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Abstract 

Many  multivariate  data  analysis  techniques  for  an  m  x  n  matrix  Y  are  related  to  the  model 
Y  =  M  +  E,  where  Y  is  an  m  x  n  matrix  of  full  rank  and  M  is  an  unobserved  mean  matrix 
of  rank  K  <  (to  A  n).  Typically  the  rank  of  M  is  estimated  in  a  heuristic  way  and  then  the 
least-squares  estimate  of  M  is  obtained  via  the  singular  value  decomposition  of  Y,  yielding  an 
estimate  that  can  have  a  very  high  variance.  In  this  paper  we  suggest  a  model-based  alternative 
to  the  above  approach  by  providing  prior  distributions  and  posterior  estimation  for  the  rank  of 
M  and  the  components  of  its  singular  value  decomposition. 

Some  key  words:  Carlson’s  hypergeometric  function,  directional  data,  factor  analysis,  interac¬ 
tion,  model  selection,  relational  data,  social  network,  Steifel  manifold. 


1  Introduction 

Every  to  x  n  matrix  M  has  a  representation  of  the  form  M  =  UDV7  where,  in  the  case  m  >  n, 

•  U  is  an  to  x  n  matrix  with  orthonormal  columns; 

•  V  is  an  n  x  n  matrix  with  orthonormal  columns; 

•  D  is  an  n  x  n  diagonal  matrix,  with  diagonal  elements  [d\ , . . . ,  dn}  typically  taken  to  be  a 
decreasing  sequence  of  non-negative  numbers. 
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The  triple  {U,D,V}  is  called  the  singular  value  decomposition  of  M.  The  squared  elements 
of  the  diagonal  of  D  are  the  eigenvalues  of  MrM  and  the  columns  of  V  are  the  corresponding 
eigenvectors.  The  matrix  U  can  be  obtained  from  the  first  n  eigenvectors  of  MM7.  The  number  of 
non-zero  elements  of  D  is  the  rank  of  M. 

Many  data  analysis  procedures  for  matrix-valued  data  Y  are  related  to  the  singular  value 
decomposition,  partly  due  to  its  appealing  interpretation  as  a  multiplicative  model  based  on  row 
and  column  factors.  Given  a  model  of  the  form  Y  =  M  +  E,  the  elements  of  Y  can  be  written 
yl  j  =  u'Dvj  +  ejj,  where  Uj  and  v;  are  the  ith  and  jth  rows  of  U  and  V  respectively.  Models 
of  this  type  play  a  role  in  the  analysis  of  relational  data  (Harshman  et  al.,  1982),  biplots  (Gabriel 
1971,  Gower  and  Hand  1996)  and  in  reduced-rank  interaction  models  for  factorial  designs  (Gabriel 
1978,  1998).  The  model  is  also  closely  related  to  factor  analysis,  where  the  row  vectors  of  Y  are 
modeled  as  i.i.d.  samples  from  the  model  y,;  =  UjDV7  +  e*.  In  this  situation,  the  goal  is  typically 
to  represent  the  covariance  across  the  n  columns  by  their  relationship  to  K  <  n  unobserved  latent 
factors. 

The  singular  value  decomposition  also  plays  a  role  in  parameter  estimation  for  the  above  model: 
Assuming  the  rank  of  the  mean  matrix  M  is  K  <  n  and  letting  (U,D,V)  be  the  singular  value 
decomposition  of  the  data  matrix  Y,  the  least-squares  estimate  of  M  (and  maximum  likelihood 
estimate  under  Gaussian  noise)  is  given  by  M/^-  =  1;A-j,  where  U[,i;/c]  denotes 

the  first  K  columns  of  U  and  ^\\:k,\:K]  denotes  the  first  K  rows  and  columns  of  D  (Householder 
and  Young  1938,  Gabriel  1978).  In  applications  such  as  signal  processing,  image  analysis,  and  more 
recently  large-scale  gene  expression  data,  representing  a  noisy  data  matrix  Y  by  M/^-  with  K  «  n 
has  the  effect  of  capturing  the  main  patterns  of  Y  while  eliminating  much  of  the  noise. 

Despite  its  utility  and  simplicity,  two  issues  limit  the  use  of  the  singular  value  decomposition  as 
an  estimation  procedure.  The  first  is  that  the  rank  K  of  the  approximating  mean  matrix  M k  must 
be  specified.  Standard  practice  is  to  plot  the  singular  values  of  Y  in  decreasing  order  and  then 
select  K  to  be  the  index  where  the  last  “large  gap”  occurs.  The  second  issue  is  that,  even  if  the 
rank  is  chosen  correctly,  the  least-squares  estimate  has  a  very  high  variance:  The  value  of  |  |M/<-  1 12  is 
equal  to  the  sum  of  the  first  K  eigenvalues  of  Y  Y,  which  has  expectation  E(Y'Y)  =  M,M  +  m<72I 
(where  a2  is  the  variance  of  the  elements  of  E).  As  a  result,  the  entries  of  M#  can  be  much  larger 
in  magnitude  than  the  corresponding  entries  in  M. 

Philosophical  debates  aside,  Bayesian  methods  often  provide  sensible  procedures  for  model  se¬ 
lection  and  high-dimensional  parameter  estimation.  For  the  model  described  above,  a  Bayesian 
procedure  would  provide  a  mapping  from  a  prior  distribution  p(U,D,V,cr2)  to  a  posterior  dis¬ 
tribution  p(U,  D,  V,  cx2|  Y).  Of  primary  interest  might  be  functions  of  this  posterior  distribution, 
such  as  E(M|Y)  or  the  marginal  posterior  distribution  of  the  rank  p(K |Y)  oc  p(K)p(Y\K).  Both 
of  these  quantities  require  integration  over  the  complicated,  high-dinrensional  space  of  {U,D,V} 
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for  each  value  of  K .  In  the  related  factor  analysis  model  where  the  elements  of  U  are  modeled 
as  independent  normal  random  variables,  the  difficulty  in  calculating  marginal  probabilities  has 
led  to  the  development  of  approximate  Bayesian  procedures:  Rajan  and  Rayner  (1997)  provide 
a  coarse  approximation  to  the  marginal  probability  p(Y\K)  by  plugging  in  maximum- likelihood 
estimates.  Minka  (2000)  improves  on  this  by  providing  a  Laplace  approximation  to  the  desired 
marginal  probability.  Both  of  these  procedures  rely  on  asymptotic  approximations,  and  do  not  pro¬ 
vide  Bayesian  estimates  of  M  once  the  dimension  has  been  selected.  In  contrast,  Lopes  and  West 
(2004)  provide  a  unified  procedure  that  provides  both  model  selection  and  parameter  estimation 
for  the  factor  analysis  model,  although  their  approach  to  model  selection  requires  a  complicated 
two-stage  reversible-jump  MCMC  algorithm:  The  first  stage  runs  separate  MCMC  algorithms  for 
each  rank  K  to  be  considered,  and  the  second  stage  runs  a  Markov  chain  between  ranks,  using 
results  of  the  first  stage  to  approximate  marginal  probabilities. 

In  many  situations  the  row  heterogeneity  and  column  heterogeneity  of  Y  are  of  equal  interest. 
In  these  cases,  the  factor  analysis  approaches  mentioned  above  may  be  less  appropriate  than  a 
model  for  the  singular  value  decomposition  of  M.  The  goal  of  this  paper  is  to  provide  a  method 
of  estimation  and  inference  for  such  a  model.  Specifically,  this  paper  provides  the  necessary  calcu¬ 
lations  for  Bayesian  estimation  and  model  averaging  for  a  mean  matrix  M  by  way  of  its  singular 
value  decomposition  {U,D,  V}.  In  Section  2  we  discuss  prior  distributions  for  {U,D,  V}  given  a 
fixed  rank  K ,  and  show  how  the  uniform  distribution  for  U  (the  invariant  measure  on  the  Steifel 
manifold)  may  be  specified  in  terms  of  the  full  conditional  distributions  of  its  column  vectors.  Sec¬ 
tion  3  presents  a  Gibbs  sampling  scheme  for  parameter  estimation  when  the  rank  of  M  is  specified. 
In  the  case  of  unspecified  rank,  estimation  can  be  achieved  via  a  prior  distribution  which  allows 
the  diagonal  elements  of  D  to  each  be  zero  with  non-zero  probability.  A  Markov  chain  Monte 
Carlo  algorithm  which  moves  between  models  with  different  ranks  is  constructed  via  a  Gibbs  sam¬ 
pling  scheme  which  samples  each  singular  value  dj  from  its  conditional  distribution.  This  is  done 
marginally  over  Uu-i  and  Vu,  and  requires  a  complicated  but  manageable  integration.  Section  5 
presents  a  small  simulation  study  that  examines  the  sampling  properties  of  the  Bayesian  procedure. 
It  is  shown  that  the  procedure  is  able  to  estimate  the  true  rank  of  M  reasonably  well  for  a  variety 
of  matrix  sizes,  and  the  squared  error  of  the  Bayes  estimate  £'(M|  Y)  is  typically  much  lower  than 
that  of  the  least  squares  estimator.  Model  extensions  for  non-normal  data  are  described  in  Section 
6,  along  with  an  example  analysis  of  binary  relational  data.  A  discussion  follows  in  Section  7. 

2  The  SVD  model  and  prior  distributions 

As  described  above,  our  model  for  an  mxn  data  matrix  is  Y  =  M  +  E,  where  M  is  a  rank  K  matrix 
and  E  is  a  matrix  of  i.i.d.  mean-zero  normally-distributed  noise.  We  induce  a  prior  distribution  on 
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the  matrix  M  by  way  of  a  prior  distribution  on  the  components  of  its  singular  value  decomposition 
{U,  D,  V}. 

For  a  given  rank  it ,  we  can  take  U  to  be  an  m  x  K  orthonormal  matrix.  The  set  of  such  matrices 
is  called  the  Steifel  manifold  and  is  denoted  Vk,tu ■  A  natural,  non-informative  prior  distribution 
for  U  is  the  uniform  distribution  on  which  is  the  unique  probability  measure  on  V/ym  that  is 

invariant  under  left  and  right  orthogonal  transformations.  As  discussed  in  Chikuse  (2003,  Section 
2.5),  a  sample  U  from  the  uniform  distribution  on  the  Steifel  manifold  Vk,tu  nray  be  obtained 
by  first  sampling  an  m  x  K  matrix  X  of  independent  standard  normal  random  variables  and  then 
setting  U  =  X(X/X)-1/2.  Although  this  construction  is  straightforward,  it  doesn’t  explicitly  specify 
conditional  distributions  of  the  form  p(U[ j]  |TJ [5_yi ) ,  which  are  quantities  that  will  be  required  for 
the  estimation  procedure  outlined  in  Section  3.  We  now  derive  these  conditional  distributions  via 
a  new  iterative  method  of  generating  samples  from  the  uniform  distribution  on  Vjym. 

Let  Ur  41  denote  the  columns  of  U  corresponding  to  a  subset  of  column  labels  A  C  {1 , ,K}, 
and  let  be  any  m  x  (m  —  |A|)  matrix  whose  columns  form  an  orthonormal  basis  for  the  null 
space  of  U[ ^i.  A  random  U  £  Va :,m  can  be  constructed  as  follows: 

1.  Sample  ui  uniformly  from  the  m-sphere  and  set  Umi  =  ui; 

2.  Sample  112  uniformly  from  the  (rn  —  l)-sphere  and  set  U[j2]  = 


K.  Sample  u^-  uniformly  from  the  ( m  —  K  +  l)-sphere  and  set  Ur  x]  =  N{i,...,a'-i}uA'- 

By  construction  this  procedure  generates  an  m  x  K  matrix  U  having  orthonormal  columns.  The 
following  result  also  holds: 

Proposition  1  The  probability  distribution  of  U  is  the  uniform  probability  measure  on  VK,m- 

A  proof  is  provided  in  the  Appendix.  Since  this  probability  distribution  is  invariant  under  left  and 
right  orthogonal  transformations  of  U  (see,  for  example,  Chikuse  2003),  it  follows  that  the  rows 
and  columns  of  U  are  exchangeable.  As  a  result,  the  conditional  distribution  of  Urji  given  any 
subset  A  of  columns  of  U  is  equal  to  the  distribution  of  N/iUj,  where  u7  is  uniformly  distributed 
on  the  (m  —  |A|)-sphere.  This  fact  facilitates  the  Gibbs  sampling  of  the  columns  of  U  and  V  from 
their  full  conditional  distributions,  as  described  in  Section  3. 

For  a  given  rank  K,  the  non-zero  singular  values  {d±, . . . ,  dx}  which  make  up  the  diagonal 
of  D  determine  the  magnitude  of  the  mean  matrix,  in  that  ||M||2  =  J2k=i  ^k-  We  model  these 
non-zero  values  as  being  samples  from  a  normal  population  with  mean  p,  and  precision  (inverse- 
variance)  if.  Conjugate  prior  distributions  for  these  parameters  include  a  normal  distribution  with 
mean  po  and  variance  Vq  for  p ,  and  a  gamma(?]o/2,  pqTq/2)  distribution  for  -i/g  parameterized  so 
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Figure  1:  A  graphical  representation  of  the  model 

that  ip  has  expectation  I/Vq.  This  parameterization  of  the  singular  values  differs  slightly  from 
that  of  the  usual  singular  value  decomposition,  in  that  the  values  {d\ , . . .  ,dx}  are  not  restricted 
to  be  non-negative  here.  A  model  enforcing  this  restriction  is  possible,  but  adds  a  small  amount 
of  computational  difficulty  without  any  modeling  benefit  (if  A  is  a  diagonal  matrix  of  ±l’s,  then 
p(Y|U,  D,  V)  =  p(Y|UA,  AD,  V)).  Finally,  the  elements  of  E  are  modeled  as  i.i.d.  normal  random 
variables  with  mean  zero  and  variance  1  /cp.  The  prior  distribution  for  the  precision  (p  is  taken  to  be 
gamma(z/o/2,  vqGq/2).  A  graphical  representation  of  the  model  and  parameters  is  given  in  Figure 
1.  Choices  for  hyperparameters  {(po,  Vq),  (rjo,  Tq ),  {uq,  Oq)}  are  discussed  in  Section  5. 

3  Gibbs  sampling  for  the  fixed-rank  model 

A  Markov  chain  with  p( U,  D,  V,  cp,  p,  ip\ Y,  K )  as  its  stationary  distribution  can  be  constructed  via 
a  Gibbs  sampling  procedure,  which  iteratively  samples  <p,  p,  ip  and  the  columns  of  U,  D  and  V  from 
their  full  conditional  distributions.  These  samples  can  be  used  to  approximate  the  joint  posterior 
distribution  and  estimate  posterior  quantities  of  interest  (see,  for  example,  Tierney  1994). 

The  full  conditional  distributions  for  (p,  p,  ip  and  the  elements  of  D  are  standard  and  are  provided 
below  without  derivation.  Less  standard  are  the  full  conditional  distributions  of  the  columns  of 
U  and  V.  To  derive  these,  consider  the  form  of  p(Y|U,  D,  V,  <p)  as  a  function  of  UrmVrji  and 
dj  =  D [jjy  Letting  E _j  =  Y  —  we  have 

|Y  —  UDV'||2  =  l|E_,  -  djU| 

=  ||E_j||2  -  2djUj,]E_jV[j]  +  ||^U|„v;jl||2 
-  llE-jl|2  ~  2<iiUjj]E.jV[j,  +  dj. 

It  follows  that  p(Y|U,  D,  V,  (p)  can  be  written 

/  A  \  mn/2  i  i 

p(Y|U, D,  V, (p)  =  eW{--(p\\B_j\\2  +  (pdjV[j]E_jVlj]  -  (1) 
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Recall  that  conditional  on  U[  ,-j]>  U[>il  =  N{-  ; ,  Uj.  where  N/_o  is  a  basis  for  the  null  space  of 
columns  of  Ur_ji  and  u j  is  uniform  on  the  m  —  (K  —  l)-sphere.  From  (1),  we  see  that  the  full 
conditional  distribution  of  u j  is  proportional  to  exp{u'-/z},  where  /i  =  0djN/YjE_jV[j].  This  is 
a  von  Mises-Fisher  distribution  on  the  m  —  (K  —  l)-sphere  with  parameter  /u.  A  sample  of  Urji 
from  its  full  conditional  distribution  can  therefore  be  generated  by  sampling  uj  from  the  von  Mises- 
Fisher  distribution  and  then  setting  U[j]  =  N|_? ,  Uj .  The  full  conditional  distribution  of  Vj ;]  is 
derived  similarly.  In  general,  the  von  Mises-Fisher  distribution  on  the  p-sphere  with  parameter 
H  £  has  density  cp(||/i||)  expju'/i}  and  is  denoted  vMF(g),  and  the  uniform  distribution  on  the 
sphere  is  denoted  vMF(O).  The  normalizing  constants  for  these  two  cases  are 

Cp(«)  =  (27r)“p/2-^ - —  for  K  >  0,  Cp(0)  =  for  k  =  0, 

Ip/  2-i(«)  27rP/i 

where  Iu(x)  is  the  modified  Bessel  function  of  the  first  kind.  R-code  for  sampling  from  this  distri¬ 
bution  is  provided  at  my  website. 

Summarizing  these  results,  a  Markov  chain  with  the  desired  stationary  distribution  can  be 
constructed  by  iterating  the  following  procedure: 

•  For  j  G  {l,...,/\}, 

-  sample  (U^Y,  U^j,  D,  V,  <j>)  =  N where  uj  ~  vMF(^N“bE_jV[j1); 

-  sample  (V yj  |Y,  U,  D,  V [-j],4>)  =  where  Vj  ~  vMF^-U^E^N^); 

-  sample  (dj\ Y,  U,  D[_j  V,  (f>,  p,  ip)  ~  normal  [(U[J.]E_iV[)i]</)+pV;)/(^+V;),  l/(0+V’)]; 

•  sample  (0|Y, U, D,  V)  ~  gamma[(zzo  +  mn)/ 2,  (vo&o  +  ||Y  —  UDV/||2)/2]; 

•  sample  (p|D,  % b)  ~  normal[(V’  I]  dj  +  hq/vD/^K  +  1/uq),  1  /{i>K  +  1  /vq)\ 

•  sample  (ip\D ,p)  ~  gamma[(??o  +  K)/ 2,  (??0 Tq  +  XX  4?  ~  A^)2)/2]; 

4  The  variable-rank  model 

4.1  Prior  distributions 

In  this  section  we  extend  the  model  of  Section  2  to  the  case  where  the  rank  K  is  to  be  estimated. 
This  requires  comparisons  between  models  with  parameter  spaces  of  different  dimension.  Two 
standard  ways  of  viewing  such  problems  are  as  follows: 

•  Conceptualize  a  different  parameter  space  for  each  value  of  K,  i.e.,  conditional  on  K,  the 

mean  matrix  is  UDV7  where  the  dimensions  of  U,  D  and  V  are  m  x  K,  K  x  K  and  n  x  K 

respectively. 
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•  Parameterize  U,  D  and  V  to  be  of  dimensions  mxn,  nx  n  and  n  x  n,  but  allow  for  columns 
of  these  matrices  to  be  identically  zero.  In  this  parameterization,  K  =  Y'j=i  1  (dj  /  0). 

Each  of  these  two  approaches  has  its  own  notational  and  conceptual  hurdles,  and  which  one  to 
present  is  to  some  extent  a  matter  of  style  (see  Green  2003  for  a  discussion).  Given  a  prior 
distribution  on  K ,  the  first  approach  can  be  formulated  by  using  the  prior  distributions  of  Section 
2  as  the  conditional  distributions  of  U,  D  and  V  given  K.  The  second  approach  can  be  made 
equivalent  to  the  first  as  follows: 

1.  Let  U,  D,  V  have  the  prior  distributions  described  in  Section  2  with  K  =  n; 

2.  Let  {si, . . . ,  sn}  ~  p(K  =  Y  Sj )  x  \  where  each  sj  G  {0, 1}; 

3.  Let  S  =  diagjsi, . . . ,  sn}.  Set  U  =  US,D  =  DS,V  =  VS,  K  =  Ysj- 

Parameterizing  a  set  of  nested  models  with  binary  variables  has  been  a  useful  technique  in  a  variety 
of  contexts,  including  variable  selection  in  regression  models  (Mitchell  and  Beauchamp  1988).  We 
continue  with  this  formulation  because  it  allows  for  the  construction  of  a  relatively  straightforward 
Gibbs  sampling  scheme  to  generate  samples  from  the  posterior  distribution. 

The  matrices  U,D  and  V  described  in  1,  2  and  3  above  are  exchangeable  under  simultaneous 
permutation  of  their  columns.  It  follows  from  Proposition  1  that,  conditional  on  si,...,sn,  the 
non-zero  columns  of  U  and  V  are  random  samples  from  the  uniform  distributions  on  s:j  ,m  and 
Vj2sj’n  resPectively)  and  that  conditional  on  {sj  =  1,  U^],  V [  _,,•]}, 

U[b]  =  V[J]  ±  N|_j}v,  where 

•  N{-*}  and  Nf  are  orthonormal  bases  for  the  null  spaces  of  U[-j]  and  V[ 

•  u  and  v  are  uniformly  distributed  on  the  (m  —  Y  sj  +  1)-  and  (n  —  Y  sj  +  l)-spheres. 

This  property  will  facilitate  posterior  sampling  of  the  columns  of  U,  D  and  V,  as  described  in  the 
next  subsection. 

4.2  Posterior  estimation 

Let  ©  =  {U,D,  V},  0j  =  {Urj|,c?j,  Vr  u}  and  ©_j  =  {©^  :  k  /  j)  .  In  this  subsection  we  derive 
the  full  conditional  distribution  of  Qj  given  {Y,  ©_•/,  0,  /U,  ip}  under  the  model  described  in  the 
previous  subsection.  The  prior  and  full  conditional  distributions  of  0,  p  and  ip  remain  unchanged 
from  Section  2.  The  full  conditional  distributions  can  be  used  in  a  Gibbs  sampling  scheme  to 
generate  approximate  samples  from  p(U,  D,  V,  (f>,  A|  Y). 

Under  the  model  and  parameterization  described  above,  the  components  of  ©y-  are  either  all 
zero  or  have  a  distribution  as  described  in  Section  2.  To  sample  Qj,  we  first  sample  whether  or 
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not  the  components  are  zero,  and  if  not,  sample  the  non-zero  values.  More  specifically,  sampling 
© j  from  its  full  conditional  distribution  can  be  achieved  as  follows: 

1.  Sample  from  ({dj  =  0},  { dj  /  0})  conditional  on  Y,  ©-,,  4>,  Hi  ■ 


2.  If  {dj  =  0}  is  true,  then  set  dj,  Umi  and  Vr^-i  all  equal  to  zero. 

3.  If  {dj  /  0}  is  true, 


(a)  sample  dj |Y,  &  j,  4>,  H ,  i/),  {dj  /  0}; 

(b)  sample  {U^j,  V[j]}|Y,  ©_,,  (/),  dj. 


The  steps  1,  2,  and  3  above  constitute  a  draw  from  p(©,|Y,  ©_,,  <f>,  Hi  ip).  The  first  step  requires 
calculation  of  the  odds: 


odds(dj  jd  0|  Y,  &-j,  Hi  V>) 


p{dj  /  O|0-j)  x  p{Y\&-j,dj  /  0 
P(dj  =  01©-,)  p(Y\@-j,dj  =  0,  <f>,  Hi  if>) 


(2) 


The  first  ratio  is  simply  the  prior  conditional  odds  of  {dj  ^  0}  and  can  be  derived  from  the  prior 
distribution  on  the  rank  K.  The  second  term  in  (2)  can  be  viewed  as  a  Bayes  factor,  evaluating  the 
evidence  in  the  data  for  additional  structure  in  E[Y]  beyond  that  provided  by  ©_,.  Recall  from 
the  previous  section  that  Y  —  UDV'  =  E_j  —  rij U[  y]Vj  ,  and  so  we  can  write 


P(Y|U,  D,  V,(/),h,  ip) 


(  0  \mn/2 

\2vr  J 


p(Y|0-j,  dj 


exP{-^llE-,l|2} 

=  0,  (j),  H,  VO  x  exp{ 


exP{ exP{ <HU [}j\ E-j V y] } 

-^dj}  x  expl^djUjjjE-jVjj]}  (3) 


The  first  term  in  (3)  is  equal  to  the  denominator  of  the  Bayes  factor,  and  is  simply  a  product 
of  normal  densities  with  the  elements  of  Y  having  means  given  by  Uj  Vj  ^  and  equal 

variances  1  /</>.  The  numerator  of  the  Bayes  factor  can  be  obtained  by  integrating  (3)  over  Qj  with 
respect  to  its  conditional  distribution  given  /x,  ip,  ©_,  and  {dj  /  0}.  Integrating  first  with  respect 
to  Ujj],  V[ j],  we  need  to  calculate  E[exp{(f)djXJ{  ^E_jV[j]}|©_,-,  dj].  Let  m  =  m  —  Ylk^j{dk  /  0} 
and  h  =  n  —  Ylk^j{dk  /  0}.  Recall  that  conditional  on  ©-,-,  Uyj  =  N|  -,}u  ^d  ^  N|_ -}v 
where  u  and  v  are  uniformly  distributed  on  the  m-  and  h-spheres.  Letting  E  =  N|Y}E_,lNT_ji , 
the  required  expectation  can  therefore  be  rewritten  as  EuvfexpjVxiju'Ev}].  This  expectation  is 
non-standard,  and  is  derived  in  the  appendix.  The  result  gives: 


PC Y| ©-, ,(f>,H,ip,dj)  =  p{ Y| 0_, ,(/>,H,il>,  dj 


0)  x  exp{ 


1 

2 


V»d2}^||E||2V2/dfa« 

1=0 


(4) 


where  the  sequence  {az}o°  can  be  computed  exactly  and  is  given  in  the  appendix. 


The  calculation  of  p(Y|0_j,  0,  p,  ip,  dj  /  0)  is  completed  by  integrating  (4)  over  dj  with  respect 
to  p(dj|0_j,  </>,  ip,  dj  /  0),  the  normal  density  with  mean  p  and  precision  ip.  This  integration 
simply  requires  calculating  the  even  moments  of  a  normal  distribution,  resulting  in 

OO 

P(Y\ +  0)  =  p(Y\&-j,(p,ip,dj  =  0)  x  l|E|| 2latbi  (5) 

1=0 

where  the  sequence  {6;}g°  is  given  by 

bi = *2‘  {ih) '  eM-l^^+^}El{v^{z+jf^)}2,] 

where  Z  is  standard  normal.  The  required  moments  can  be  calculated  iteratively,  see  for  example 
Smith  (1995).  The  conditional  odds  of  {dj  /  0}  is  therefore 

odds(dj  /0|Y,  ©_„</>,  A)  =  Pp[^tl\tlJj)  x  jtm2laibi. 

In  practice,  only  a  finite  number  of  terms  can  be  used  to  compute  the  above  sums.  The  sum  in 
(4)  can  be  bounded  above  and  below  by  modified  Bessel  functions,  and  the  error  in  a  finite-sum 
approximation  can  be  bounded,  at  least  to  the  extent  that  one  can  compute  the  bounding  Bessel 
functions.  This  can  also  provide  a  guide  as  to  how  many  terms  to  include  in  approximating  (5). 
Details  are  given  in  the  Appendix. 

If  {dj  /  0}  is  sampled  it  is  still  necessary  to  sample  dj,\Jyi  and  Vu.  Multiplying  equation 
(4)  by  the  prior  for  dj\{dj  /  0},  the  required  conditional  distribution  for  dj  is  proportional  to 

OO 

p{dj\Y,&^j,<P,n,iP,{dj  +  0})  oc  e-^-^e-H2(^||E| \2l(P2ldfai 

1=0 

which  is  an  infinite  mixture  with  the  following  components: 

•  mixture  weights:  wi  oc  ||E||2ia;6/ 

•  mixture  densities:  fi(d)  oc  d21  exp{  —  ^{d  —  p)2ip},  where  p  =  pip/{(p  +  ip)  and  ip  =  <p  +  ip 

The  density  fi(d)  is  nonstandard,  but  can  be  sampled  from  quite  efficiently  using  rejection  sam¬ 
pling  with  a  scaled  and  shifted  f-distribution  as  the  approximating  density  (the  tails  of  a  normal 
distribution  are  not  heavy  enough). 

To  sample  Urji  and  Vrji  we  first  sample  u  and  v  from  their  joint  distribution  and  then  set 
U[j]  =  Nj_? ,  u  and  =  NV  .jV.  Equation  (3)  indicates  that  the  joint  conditional  density  of 
{u,  v}  is  of  the  form 

p(u,  v|Y ,  &~j,  <p,  p,  ip,  dj)  =  c(A)exp{u'Av},  (6) 

where  A  =  cpdj'E  and  c(A)-1  =  Cm(0)_1Cn(0)^1  ||A||aaj.  This  density  defines  a  joint  distri¬ 

bution  for  two  dependent  unit  vectors.  To  my  knowledge,  such  a  joint  distribution  has  not  been 
studied  before.  Some  useful  facts  about  this  distribution  are 
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•  the  conditional  distribution  of  u|v  is  vMF(Av),  and  that  of  v|u  is  vMF(u'A); 

•  the  marginal  distribution  of  v  is  proportional  to  -Im/2-i(l|Av||)/j|Av||rn/2~1; 

•  the  joint  density  has  local  maxima  at  {±(ufc,  v*,),  k  =  1  ,  ...,n}  where  (u  &,Vfc)  are  the  kill 
singular  vectors  of  A. 

I  have  implemented  a  number  of  rejection  and  importance  samplers  for  this  distribution,  although 
making  these  schemes  efficient  is  still  a  work  in  progress.  A  relatively  fast  approximate  method 
that  seems  to  work  well  for  a  variety  of  matrices  A  is  to  first  sample  (u,  v)  from  the  local  modes 
{±(ufc,  Vfc),  A:  =  1  according  to  the  exact  relative  probabilities  and  then  use  this  value  as 

a  starting  point  for  a  small  number  of  Gibbs  samples,  alternately  sampling  from  p(u|A,v)  and 
p(v|  A,  u). 

The  complexity  of  the  calculations  involved  in  sampling  from  p(& j\ Y,  ©  ;,  <j),  p,  ip)  suggest  we 
look  for  a  simpler  procedure.  For  example,  we  could  model  only  dj  to  be  zero  with  non-zero 
probability,  and  sample  from  its  full  conditional  distribution  instead  of  marginally  over  Ur  ji  and 
V r jj .  Unfortunately,  an  algorithm  based  on  this  approach  will  not  mix  well  across  ranks  of  M 
because  dj,  Um  and  Vm  are  dependent  to  an  extreme:  The  probability  of  sampling  dj  /  0  is 
essentially  zero  unless  Urji  and  Vu  are  near  a  pair  of  local  modes,  but  the  probability  of  Umi 
and  V[  jj  being  in  such  a  state  is  essentially  zero  if  dj  =  0.  Metropolis-Hastings  algorithms  are 
problematic  for  similar  reasons  and,  based  on  my  initial  efforts  on  this  problem,  such  algorithms 
seem  to  require  an  extreme  amount  of  tuning  of  the  proposal  distributions  to  achieve  even  minimal 
acceptance  rates.  In  contrast,  sampling  dj  marginal  over  Umi  and  V r?i  is  possible  as  shown  above, 
requires  no  tuning  and,  for  the  examples  in  this  article,  mixes  well  across  matrices  M  of  different 
ranks. 

4.3  A  suggested  Gibbs  sampling  scheme 

The  dimension-changing  Monte  Carlo  sampler  outlined  above  is  computationally  expensive  com¬ 
pared  to  the  fixed-dimension  sampler  of  Section  3.  For  this  reason,  it  may  be  desirable  to  incorporate 
the  fixed-dimension  sampler  even  when  we  are  interested  in  sampling  across  dimensions,  as  this 
might  improve  within-dimension  mixing  at  a  low  computational  cost.  One  such  algorithm  proceeds 
by  iterating  the  following  steps: 

A.  Variable  dimension  sampler:  For  each  j  €  {1, . . . ,  n},  sample  0j  =  {Urji,  dj,  Vyi}  via 

•  sampling  dj\Y,  @-j,  ; 

•  sampling  (U^-j,  V[,j])|Y,  (/>,  n,  ip,  dj. 

B.  Fixed  dimension  sampler:  For  each  {j  :  dj  0}, 
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•  sample  Urji  |Y,  &-j,  (/),  n,  ijj,  dj,  Vr  m 

•  sample  Vrji  |Y,  @_j,  (/>,  n,  ip,  dj,  Umi  ; 

•  sample  dj\Y,  @-j,  <p,  n,  ip,  Un,  Vu. 

C.  Other  terms: 

•  sample  <p\  Y,  0  ; 

•  sample  /x|D,  ip\ 

•  sample  p|D,  //. 

Alternatively,  steps  A  and  B  could  be  performed  on  random  subsets  of  indices  j.  The  distributions 
required  for  the  steps  in  A  are  outlined  in  this  section,  and  steps  B  and  C  are  outlined  in  the 
previous  section.  By  conditioning  on  whether  or  not  dj  =  0  for  each  j,  the  steps  in  B  can  be 
viewed  as  Gibbs  sampling  for  all  j  £  {l,...,n},  not  just  those  for  which  dj  /  0.  R-code  that 
implements  this  routine  is  available  at  my  website. 

5  Simulation  study 

In  this  section  we  examine  the  sampling  properties  of  the  estimation  procedure  with  a  small  simu¬ 
lation  study.  Each  dataset  in  this  study  was  simulated  from  the  following  model: 

•  U  ~  uniforrn(V5)m),  V  ~  uniform(V.5,n)  ; 

•  D  =  diag{di, . . . ,  d5},  {di, . . . ,  d5}  ~  i.i.d.  uniform (\nmn,  |/w). 

•  Y  =  UDV;  +  E,  where  E  is  an  m  x  n  matrix  of  standard  normal  noise. 

For  each  value  of  m  and  n,  the  sampling  mean  of  {cii, . . . ,  d§}  was  taken  to  be  /jmn  =  y/n  +  m  +  2 sjrvm. 
Such  a  value  should  distribute  the  singular  values  {d\, . . . ,  d§}  near  the  “cusp”  of  detectability:  As 
shown  in  Edelman  (1988),  the  largest  singular  value  of  an  m  x  n  matrix  E  of  standard  normal  noise 
is  approximately  jimn  for  large  m  and  n. 

Three-hundred  datasets  were  generated  using  the  model  above,  one-hundred  for  each  of  the  three 
sample  sizes  (m,n)  £  {(10, 10),  (100, 10),  (100, 100)}.  These  were  generated  in  the  R  statistical 
computing  environment  using  the  integers  1  through  100  as  random  seeds  for  each  of  the  three 
sample  sizes.  Code  to  generate  these  datasets  is  available  from  my  website.  Prior  distributions 
for  the  parameters  {</>,  /i,  if)}  were  taken  as  described  above  with  “prior  sample  sizes”  of  r'o  =  2 
and  r/o  =  2.  This  gives  exponential  prior  distributions  for  4>  and  V;-  The  values  of  <7q ,  p-o  and  Tg 
were  derived  from  “empirical  Bayes” -type  estimates  obtained  by  averaging  over  different  ranks  as 
follows: 
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1.  For  each  k  £  {0, . . . ,  n}, 

(a)  Let  UDV  be  the  least-squares  projection  of  Y  onto  the  set  of  rank-/c  matrices; 

(b)  Let  a\  =  ||Y  —  UDV  ||2/(nm) 

(c)  Let  Afe  =  £*=  1  dj/k,  tI  =  E*_i(d,-  -  d)2/fc. 

2.  Let  a2  =  ^  £”=0  <^o  =  ^  E”=o  A;,  *§  =  £  E”=o(A,  -  A)2,  r2  =  ^  E”=o  *f- 


The  resulting  prior  distributions  are  weakly  centered  around  averages  of  empirical  estimates,  where 
the  averaging  is  over  ranks  0  through  n.  Finally,  the  prior  distribution  on  the  rank  K  of  the  mean 
matrix  was  taken  to  be  uniform  on  {0, . . .  ,n}.  Other  simple  priors  that  gave  similar  results  for 
this  simulation  study  include  diffuse  mean-zero  normal  distributions  for  the  dj' s  (used  in  the  next 
section),  and  one  in  which  the  conditional  mean  of  the  dj' s  given  <p  is  cp^1^2 yjm  +  n  +  2 \Jmn.  This 
latter  prior  distribution  essentially  focuses  the  search  for  non-zero  dj ’s  to  values  that  are  as  large 
as  the  largest  singular  values  of  normally  distributed  noise  matrices,  and  will  generally  result  in  a 
posterior  distribution  that  puts  more  mass  on  lower  values  of  K  than  would  a  prior  distribution 
for  the  dj' s  centered  around  zero.  A  more  complicated  alternative  to  these  approaches  would  be 
to  have  the  prior  distributions  for  {(j),  p,,  ip}  depend  on  K.  For  example,  given  K  =  k,  the  prior 
distributions  for  {<A,  /x,  -0}  could  be  based  on  {<t|,  ff}.  Such  prior  distributions  would  require 
some  minor  modifications  to  the  variable-dimension  sampler  outlined  in  the  previous  section. 

For  each  of  the  100  x  3  datasets,  10,  000  iterations  of  the  Gibbs  sampling  scheme  described  in 
Section  4.3  were  run  to  obtain  approximate  samples  from  the  posterior  distribution  of  UDV7.  All 
Markov  chains  were  begun  with  K  =  0  and  {</>,  p,  ip}  set  equal  to  their  prior  modes.  Summaries  of 
the  posterior  distributions  for  the  three  different  values  of  ( m,n )  are  displayed  in  Figure  2.  The 
first  column  of  each  panel  plots  the  MCMC  approximation  to  the  expected  value  of  p(K |Y)  for 
each  value  of  (m,n).  The  expectation  Ey[p{K |Y)]  is  approximated  by  £g=°1p(-K’|Y(s)),  where 
Y(s)  is  the  sth  simulated  dataset  for  a  given  value  of  (m,n)  (for  the  case  m  =  n  =  100,  p(K |Y)  is 
plotted  only  for  K  <  10,  although  the  distribution  extends  beyond  this  value).  These  distributions 
are  all  peaked  around  the  correct  value  K  =  5  Also  of  interest  is  how  frequently  the  posterior  mode 
K  =  arg  max p( iL |  Y)  obtains  the  true  value  of  K  =  5.  This  information  is  displayed  in  the  second 
column  of  Figure  2,  which  gives  the  empirical  distribution  of  K  taken  over  each  of  the  100  datasets. 
As  we  see,  the  true  value  K  =  5  is  the  most  frequent  value  of  the  estimate  in  each  dataset,  with 
K  =  4  a  close  second.  This  is  not  too  surprising,  as  half  of  the  simulated  singular  values  are  below 
the  rough  detection  threshold  of  \J n  +  m  +  2  sjmn. 


Lastly  we  consider  the  effect  of  shrinkage  on  the  estimate  of  the  mean  matrix  M.  For  each  simulated 
dataset  the  posterior  mean  M  =  if[M|Y]  was  obtained  by  averaging  its  value  over  the  104  scans 
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Figure  2:  Results  of  the  simulation  study.  Plots  in  the  first  column  give  the  averages  of  p(K |Y) 
over  100  simulated  datasets.  The  second  column  gives  the  empirical  distribution  of  the  posterior 
mode  K .  The  third  column  gives  the  distribution  of  the  ratio  of  the  squared  error  of  the  Bayes 
estimate  of  M  to  that  of  the  least-squares  estimate. 
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of  the  Gibbs  sampler.  The  squared  error  in  estimation,  averaged  over  elements  of  the  mean  matrix 
was  calculated  as  ASE#  =  ||M  —  M|| 2/(mn)  where  M  is  the  mean  matrix  that  generated  the 
data.  This  value  is  compared  to  ASE^g,  which  is  the  corresponding  average  squared  error  of  the 
least-squares  projection  of  Y  onto  the  space  of  rank- A"  matrices.  The  distribution  of  this  ratio  is 
mostly  below  1  for  the  case  m  =  n  =  10,  and  strictly  below  1  for  the  other  two  cases  where  there  are 
more  parameters  to  estimate.  This  corresponds  with  our  intuition:  The  model-averaged  estimates 
improve  relative  to  the  least-squares  estimates  as  the  number  of  parameters  increases.  These  results 
indicate  that  simply  obtaining  a  posterior  estimate  K  of  K  and  then  using  the  corresponding  rank- 
K  least-squares  estimate  of  M  generally  results  in  an  estimate  that  can  be  substantially  improved 
upon  by  model  averaging,  at  least  in  terms  of  this  error  criterion. 

6  Extension  and  example:  analysis  of  binary  relational  data 

A  potentially  useful  extension  of  the  model  described  above  is  to  a  class  of  generalized  bilinear 
models  of  the  form 


Oij  =  /3'xjj  +  u'Dvj  +  eij 

E[Vi,j\®\  =  9 

where  g  is  the  link  function.  Such  models  allow  for  the  analysis  of  a  variety  of  data  types: 
For  example,  binary  data  can  be  modeled  as  yig  ~  binary ( \ } )  and  count  data  as  iji.j  ~ 
Poisson(exp{#jj}).  Gabriel  (1998)  considered  maximum  likelihood  estimation  for  a  variant  of  this 
model  in  situations  where  the  dimension  of  D  is  fixed,  and  Hoff  (2005)  considered  a  symmetric 
version  of  this  model  for  the  analysis  of  social  network  data.  Parameter  estimation  and  dimension 
selection  for  the  above  model  can  be  made  by  sampling  from  a  Markov  chain  generated  by  a  mod¬ 
ified  version  of  the  algorithm  of  Section  4.3.  Given  current  values  of  0,/3,  U,D,V,  sample  new 

values  as  follows: 

1.  Let  Y  =  0  —  X/3  =  UDV7  +  E.  Update  U,  D,  and  V  from  their  conditional  distribution 
given  Y  as  described  in  Section  4.3. 

2.  Let  Y  =  0  -  UDV'  =  X/3  +  E.  Update  /3  from  its  conditional  distribution  given  Y  (a 
multivariate  normal  distribution). 

3.  Sample  ©*  =  X/3  +  UDV7  +  E*,  where  E*  is  a  matrix  of  normally  distributed  noise  with 

zero  mean  and  precision  d>.  Replace  6,1  ,■  by  6*  ■  with  probability  A  1. 

We  illustrate  the  use  of  such  a  model  and  estimation  procedure  with  an  analysis  of  binary 
relational  data  between  46  global  service  firms  and  55  cities,  obtained  from  the  Globalization  and 
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World  Cities  study  group  (http://www.lboro.ac.uk/gawc).  For  these  data,  yly  =  1  if  firm  j 
has  an  office  in  city  i  and  yij  =  0  otherwise.  Standard  practice  is  to  represent  within-row  and 
within-column  homogeneity  with  effects  that  are  additive  on  the  log-odds  scale: 

log  odds(j/i  j  =  1)  =  (3  +  m  +  bj,  (7) 

and  so  the  effects  a  =  {ai, . . . ,  am }  and  b  =  {b\, . . . ,  bn}  constitute  a  rank-two  structure.  We  look 
for  evidence  of  higher-order  structure  by  considering  the  model 

log  odds(j/jj  =  1)  =  P  +  jij  (8) 

7 i,j  =  u/Dvj  +  eij 

The  rank-two  structure  of  model  (7)  is  easily  incorporated  into  (8)  by  fixing  =  -^=lmxl  and 
V [  2]  =  ^lnxi  and  modeling  d\  and  d 2  to  be  non-zero  with  probability  1.  The  additive  city  and 
firm  effects  are  then  given  by  a  =  c^U [,2]  and  b  =  diVn]  respectively.  Note  that  any  remaining 
effects  represented  by  UDV7  will  be  orthogonal  to  these  additive  effects,  and  that  the  mean  of 
the  matrix  UDV7  is  identically  zero,  making  it  unaliased  with  the  intercept  (5.  For  the  remainder 
of  this  analysis,  the  variable  K  will  refer  to  the  number  of  additional  non-zero  singular  values  of 
UDV7  beyond  the  additive  row  and  column  effects. 

We  fix  the  error  variance  1/0  =  1 ,  as  this  scaling  parameter  is  confounded  with  the  magnitude 
of  (3  and  UDV7.  For  simplicity  we  use  independent  normal  (0, 100)  prior  distributions  for  (3  and 
the  non-zero  elements  of  D,  and  a  uniform  prior  distribution  for  K.  A  Markov  chain  of  length 
25,000  was  constructed  using  the  algorithm  described  above,  starting  with  K  =  0.  Mixing  across 
ranks  K  was  quite  rapid  as  is  shown  in  the  first  panel  of  Figure  3,  which  displays  values  of  K  every 
100th  scan  of  the  Markov  chain.  The  Monte  Carlo  estimate  of  p(K |Y),  shown  in  the  second  panel, 
gives  a  posterior  mode  of  K  =  6  and  suggests  strong  evidence  for  structure  in  the  log-odds  beyond 
that  of  the  additive  row  and  column  effects. 

One  of  the  practical  motivations  for  selecting  an  appropriate  model  dimension  is  prediction. 
Many  binary  social  network  datasets  include  missing  values,  in  which  it  is  not  known  whether 
Uij  =  1  or  y,L)j  =  0.  In  such  cases  it  is  often  desirable  to  make  predictions  about  missing  values 
based  on  the  observed  data,  and  thus  to  base  model  selection  on  predictive  performance.  With  this 
in  mind,  we  compare  the  above  results  to  the  following  10-fold  cross  validation  procedure: 

1.  Randomly  split  the  set  of  pairs  {*,  j}  into  ten  test  sets  Ai, . . . ,  A\q. 

2.  For  K  =  0, 1, ... ,  Jimax  : 

(a)  For  l  =  1, . . . ,  10  : 

i.  With  the  rank  fixed  at  K,  perform  the  MCMC  algorithm  using  only  {y%^  :  {i,j}  0 
Ai},  but  sample  values  of  9ij  for  all  ordered  pairs. 
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Figure  3:  Posterior  estimation  of  K.  The  first  panel  plots  values  of  K  every  100th  scan  of  the 
Markov  chain.  The  second  panel  plots  the  Monte  Carlo  estimate  of  p(K\Y).  The  third  panel  gives 
the  results  of  a  cross-validation  evaluation  of  K  £  {0, . . . ,  10}. 

ii.  Based  on  the  Monte  Carlo  sample  values  {6^, . . . ,  6}-'^}  compute  the  posterior  mean 
=  ^  Ef=i  ~ ^°r  ^  and  the  log  predictive  probability  lpp(A^)  = 

(b)  Measure  the  predictive  performance  for  I\  as  LPP(/C)  =  lpp(A;). 

The  values  of  —  2LPP(/C)  for  K  £  {0, . . . ,  10}  are  shown  in  the  third  panel  of  Figure  (3).  For 
the  particular  random  partitioning  of  the  data  used  here,  the  cross-validation  procedure  suggests 
a  model  rank  of  K  =  6,  which  is  the  same  value  as  the  posterior  mode  of  the  Bayes  solution. 
However,  a  comparison  of  N  values  of  K  using  a  ten-fold  cross  validation  procedure  requires  the 
construction  of  10  x  N  separate  Markov  chains,  and  further  requires  specification  of  the  values  of 
K  to  be  compared.  In  contrast,  the  Bayesian  procedure  requires  only  one  MCMC  run  and  can 
potentially  visit  each  value  of  K  £  {1, . . . ,  n}. 

Finally  we  examine  some  of  the  patterns  in  the  structure  of  UDV7  beyond  those  of  the  additive 
effects.  The  posterior  mean  of  UDV,  minus  the  additive  effects,  was  obtained  by  averaging  over 
scans  of  the  Markov  chain.  The  first  two  singular  values  and  vectors  of  this  matrix  were  obtained, 
and  the  values  of  the  resulting  row  (city)  effects  are  plotted  in  Figure  4.  These  values  are  strongly 
related  to  geography:  U.S.  cities  cluster  together,  as  do  cities  in  Europe,  Latin  America  and  from 
the  Pacific  rim. 
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Figure  4:  City  specific  effects:  The  first  two  left  singular  vectors  of  E(M\Y)  indicate  strong  geo¬ 
graphic  patterns  in  the  data. 
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7  Discussion 


This  paper  has  presented  a  model-based  data  reduction  and  representation  method  for  multivariate 
and  matrix-valued  data.  The  approach  is  to  model  the  data  matrix  Y  as  equal  to  a  reduced-rank 
mean  matrix  M  plus  Gaussian  noise,  and  to  simultaneously  estimate  M  along  with  its  rank.  The 
approach  is  Bayesian  and  the  estimation  procedure,  based  on  Markov  chain  Monte  Carlo,  allows 
for  a  wide  variety  of  model  extensions,  such  as  to  generalized  bilinear  models  as  described  in  the 
previous  section.  Other  straightforward  extensions  include  estimation  using  replicate  data  matrices 
and  estimation  subject  to  missing  data.  This  latter  extension  may  be  of  particular  use  in  the 
analysis  of  relational  data  among  a  large  number  of  nodes,  where  it  may  be  too  costly  to  make 
observations  on  all  possible  pairs.  In  such  cases,  the  value  of  may  be  missing  for  many  pairs, 
but  one  can  make  predictions  based  on  estimates  u,,  D,  vy  obtained  from  the  observed  data.  Using 
this  approach  to  predict  missing  links  in  social  networks  and  protein-protein  interaction  networks 
is  one  of  my  current  research  areas.  However,  for  large  datasets  with  1000  nodes  (106  observations) 
or  more,  the  MCMC  scheme  in  this  article  becomes  prohibitively  computationally  expensive.  I  am 
currently  studying  methods  of  making  approximate  Bayesian  inference  for  large  relational  datasets. 
These  include  Laplace  approximations  for  various  components  of  the  MCMC  scheme  of  Sections  3 
and  4,  and  using  variational  methods  for  approximating  joint  posterior  distributions  (Jordan  et  al., 
1999). 

Computer  code  and  data  for  all  numerical  results  in  this  paper  are  available  at 

www . stat . Washington . edu/hof f . 

A  Proof  of  proposition  1 

We  first  construct  a  sample  from  the  uniform  distribution  on  V/ym  and  then  show  that  it  has 
the  desired  conditional  distributions.  Let  zi,...,zk  be  i.i.d.  multivariate  normal  (0,ImXm)-  Let 
xi  =  zi  and  for  j  =  1, . . . ,  K  —  1  let 

•  xi  =  (xi  •  •  •  Xj); 

•  Pj=I  -X^X'X^X'; 

•  xi+i  =  f>jzj+ 1- 

Note  that  P j  is  the  symmetric,  idempotent  projection  matrix  of  M*  onto  the  null  space  of  Xj,  and 
so  the  vectors  xi, . . . , Xj+i  are  orthogonal.  For  each  j,  let  Uj  =  Xj(XjXj)-1/2.  For  j  =  K,  we 
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have 


/  x'  \ 


X^XK  = 


Xo 


V x/'  ) 


(Xl  X2 


x/(]  = 


( 

V 


lxir  0 

0  |x2| 
0  0 


0  ^ 
0 

x/d2  ) 


and  so 


U  K 


X2 

X2 


X|f 


Xj{ 


is  a  matrix  of  K  orthonormal  vectors  in  Rm.  The  proof  will  be  complete  if  we  can  show  the 
following: 


Lemma  1:  The  distribution  of  U j<  is  the  uniform  distribution  on  VK,m- 

Lemma  2:  Ur^+1]|Ufc  =  Nkuk+i  where  N/c  is  an  orthonormal  basis  for  the  null  space  of  U*,  and 
Ufc_)_i  is  distributed  uniformly  on  the  rn  —  k  dimensional  sphere. 

Proof  of  Lemma  1.  By  Theorem  2.2.1  (iii)  of  Chikuse  (2003),  an  m  x  K  matrix  of  the  form 
X^(X,ft-Xjf)-1/2  is  uniformly  distributed  on  VK,m  if  ~^K  is  an  rn  x  K  random  matrix  with  rank  K 
a.s.  and  having  a  distribution  that  is  invariant  under  left-orthogonal  transformations.  We  will  show 
left  invariance  for  each  X^.  constructed  above  by  induction.  Let  H  :  Mm  — >  Rm  be  an  orthogonal 
transformation,  and  note  that  HXi  =  Hxi  =  xi  =  Xi.  Now  suppose  HXj  =  X&.  The  distribution 
of  HXfc+i  is  determined  by  its  characteristic  function: 

A;+l  k 

£[exp{zy~V?-Hxj}]  =  E[ex.p{i  ^  t  jHxj}£l[exp{zt/fc+1Hxfc+i}|Xfc]] 

3= 1  i=1 


Note  that  tj(+1Hxfc+i  =  (PfcH'tfc+i/zfc+i,  where  z^+i  is  a  vector  of  independent  standard  normals 
and  independent  of  Xj..  Thus  the  characteristic  function  can  be  rewritten  as 

^[exp{i^t'  Hxi}exp{--t/A,+1HPfcP,A.H'tfc+i}]  =  E[exp{i  t'-Xj}  exp{--t'fc+1Pfctfc+i}]  (9) 
j= 1  i=1 

where  xy  =  Hx^  and 

Pfc  =  HPfcP'H'  =  HPfcH' 

=  H(I  —  Xfc(X/fcXfc)_1X/fc)H/ 

=  I  -  HX/c((HXfc)/(HX^.))_1(HXfc)/ 

A  similar  calculation  shows  that  the  distribution  of  X?  is  characterized  by 

fe+i  k 

T[exp{z  ^  t'xj}]  =  ^[exp{i^t'xi}exp{--t/fc+1PfctA;+1}],  (10) 

3= 1  3= 1 
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By  assumption,  Xfc  =  HXfc,  and  so  {xi, . . .  ,  Xfc,Pfc}  =  {xi, . . .  ,  Xfc,Pfc}  and  the  expectations  (9) 
and  (10)  are  equal.  Since  the  characteristic  functions  specify  the  distributions,  HXj+1  =  Xfc+1 
and  the  lemma  is  proved. 

Proof  of  Lemma  2:  The  vector  Ur*.+1i  is  constructed  as  Ur*.+1]  =  PfcZfc+i/|PfcZfc_|_i|.  P&  has 
m  —  k  eigenvalues  of  one,  the  rest  being  zero,  giving  the  eigenvalue  decomposition  Pfc  =  N^N^. 
where  Nfc  is  a  m  x  (m  —  k )  matrix  whose  columns  form  an  orthonormal  basis  for  the  null  space  of 
Ufc.  Substituting  in  NfcN^  for  Pfc  gives 


u[.fc+i] 


|NfcN'fczfe+1| 

NfeZfc+i 


N 


N  k 


Nfc- 


fc(z'NfcN,fcNfcN,fcz  )V2 

Nfcz fc+i 


(z'NfcN^z)1/2 

NfcZ 

N'fcz| 


Note  that  for  each  k,  Ufc  =  Xfc(X/fcXjt)-1/2,  and  so  the  projection  matrix  Pfc  can  be  written  as 
I  —  UfeU^,  a  function  of  Ufc.  Therefore,  given  Ufc,  Ur  fc+1i  is  equal  in  distribution  to  Nfc  (a  function 
of  Ufc)  multiplied  by  NfcZ/|Nj(z|.  The  distribution  of  N'fcz  can  be  found  via  its  characteristic 
function:  For  an  m  —  k-ve ctor  t 


-E'[exp{it/(Nfcz)}]  =  F;[exp{i(Nfct)'z}] 

=  exp{— ^t'NfcNfct} 

=  exp{-Vt}, 

and  so  we  see  that  NfcZfc+i  is  equal  in  distribution  to  an  m  —  k- vector  of  independent  standard 
normal  random  variables,  and  so  NfcZfc+i/|NfcZfc_|_i|  is  uniformly  distributed  on  on  the  m  —  fc-sphere. 


B  Expectation  of  eu 

In  this  section  we  compute  £,[eu  '^-v]  for  uniformly  distributed  unit  vectors  u  and  v  and  an  arbitrary 
mxn  matrix  A.  Integrating  with  respect  to  v  can  be  accomplished  by  noting  that  as  a  function  of 
v,  eu,^v  is  proportional  to  the  von  Mises-Fisher  distribution  on  the  n-sphere  Sn,  with  parameter 

u'A: 
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eu'Avp(v)  dSn(v) 


J  eu  Avcn(0)  d,Sn(v) 


Cn(0) 


Cn(||u'A||) 

=  r(n/2)(2/||u'A||)’*/2-1/„/2_1(|tu'A| 


where  Ij,  is  the  modified  Bessel  function  of  the  first  kind.  The  series  expansion  of  /n/2_i(||u/A||) 
gives 


r(„/2)(2/||u'A||)"/2-1/„/2_1(||u'A||)  =  £  ||u'A| 
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T(n/2) 


1=0 


T(Z  +  1)T(Z  +  n/2)Al 


All  the  terms  in  the  sum  are  positive,  so  i?[eu,Av]  can  be  found  by  replacing  1 1 u7 A|  |2Z  with  its 
expectation  in  the  above  equation.  To  compute  this  expectation,  let  A  =  LA1/2!?/  be  the  singular 
value  decomposition  of  A,  where  L'L  =  R/R  =  I  and  A  is  a  diagonal  matrix  of  the  eigenvalues  of 
A;A.  Then 


u'A 


2 


uAA;u 

u'LA1/2R'RA1/2L'u 

u,LAL,u 
u  Au 

n 

3= 1 


where  u  =  L'u.  We  will  now  identify  the  distribution  of  the  vector  {u2, . . .  ,  ft2}.  Let  B  =  {L,  L^} 
be  an  orthonormal  basis  for  Mm.  Since  the  uniform  distribution  on  the  sphere  is  rotationally 
invariant,  BTi  is  equal  in  distribution  to  u,  and  so  L'u  is  equal  in  distribution  to  the  first  n 
coordinates  of  u.  Recall  that  a  uniformly  distributed  vector  u  can  be  generated  by  sampling 
zi,...,zm  independently  from  a  standard  normal  distribution  and  then  dividing  each  term  by 
|  X>*2|1/2-  Therefore, 


U~, 2  f.2\  A  iflj  •  '  ~  ’  Zn I 

lu,l,  ■  ■  •  ,  unS  ^2 

Ej=i z]  ) 

=  0q 
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where  9  ~  beta(n/2,  (m  — n)/2),  q  ~  Dirichletn(l/2, . . . ,  1/2)  and  9  and  q  are  independent.  There¬ 
fore,  Hu'AII2  =  0A;q,  where  A  is  the  diagonal  of  A  and  are  the  eigenvalues  of  A' A.  The  required 
expectation  is  then 

£[||u'A||2Z]  =  E[9l]E[{  A'q)'] 

The  first  expectation  is  given  by  E[9l ]  =  [T(n/2  +  Z)T(m/2)]/[T(m/2  +  l)T(n/ 2)].  The  second 
expectation  is  the  /th-moment  of  a  Dirichlet  average,  which  results  in  a  type  of  multiple  hypergeo¬ 
metric  function  denoted  as  R{( A,  ^1).  This  expectation  and  its  generalizations  have  been  studied 
by  Carlson  (1977,  Chapter  5),  Dickey  (1983)  and  others.  An  algorithm  for  recursively  computing 
R\ ,  ... . ,  Rj  exactly  from  a  generating  function  is  provided  in  the  next  section. 

To  make  the  result  of  the  calculation  a  little  more  intuitive  let  A  =  A/^/)A j  =  A/||A||2,  and 
make  use  of  the  fact  that  ^[(A^)^  =  ||A||2ii?[(A  q)*],  Combining  the  results  gives 


E[e 


u'Avi 


E 

1=0 


A||2/£[(Aq)( 


T(m/2) 


T(m/2  +  l)T(l  +  Z)4l 


E 

;=o 


1 2 1, 


ai, 


and  so  we  see  how  the  expectation  is  related  to  the  norm  of  A  via  ||A||2i  and  the  variability  in 
relative  sizes  of  the  squared  singular  values  via  E[( A  q)z]. 

To  get  bounds  on  a  finite-sum  approximation  to  £’[eu,'^v],  note  that  Aaiin  <  L^A'q)*]  <  Aaiax 
so 


OO 

E 

l=r+ 1 


T(m/2) 

T(m/2  +  Z)T(l  +  04' 


< 


OO 

E 

l=r+ 1 


Fh.VnVl  r(rn/2) 

K  q)JT(m/2  +  0T(l  +  04' 


< 


E 

l=r+ 1 


T(m/2) 

T(m/2  +  0T(l  +  /)4z 


The  outer  sums  can  be  computed  as 


E  A‘ 

l—r+l 


T(m/2) 


T(m/2  +  l)T(l  +  l)4l 


2 

71 


m/2—  1 


and  so  bounds  on  iT[eu,^v]  —  ^7=0  ||A||2icq  can  be  obtained,  at  least  to  the  same  precision  with 
which  one  can  compute  the  modified  Bessel  function  Im/2-  -i(VX). 


C  Computing  ^[(A'qt 


Let  q  ~  Dirichletn(ai, . . . ,  an).  Carlson  (1977,  Section  6.6)  shows  that 


ID  - 

2=1 


OO 


E 

1=0 


T(q'1  +  1) 
T{a'l)T{l  +  1) 


tlE[(  A'q)*]. 


Let  ci  =  F(Q^7(;^i)-^'[(A/q)7  We  now  show  how  to  calculate  c^+i  based  on  ci,...,Cfc.  Let 
f(t)  =  )C7o  cRl  be  the  right-hand  side  of  the  equation  and  g(t)  =  —  Ya= i  ai  log(l  —  t\)  be  the  log 
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of  the  left-hand  side.  Taking  derivatives  with  respect  to  t  and  evaluating  at  zero  we  have 


/W(o)  =  T(l  +  l)q, 


Since  f(t)  =  e9^\  we  have 


g«\0)  =  r(l)J2<XiK- 

2—1 


/(fe+i)(°)  =  ^Q/(0(%(fe+i-0(o). 

Plugging  the  values  of  /®( 0)  into  the  sum  gives 

k  ' 

°k+ 1  =  y~]  Q 

z=o  . 

Simplifying  gives 

k 

E[(A'q)l+1]=^|B[(Vq) 


*\  r((  +  i)r(fc  +  i  -  ()  y, 
l )  T(k  +  2)  1  ^ 


a,;  A, 


fc+i-i 


vi=l 


i=o  L 


r(l'o;  +  l)T(k  +  1) 


r(i'Q!  +  k  + 1) 


<2=1 


C-code  with  an  R-interface  to  calculate  {£[(A  q)  :  l  =  0, . . . ,  k}  is  available  at  my  website. 
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